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\ > j"*; ' Having in mind the large-scale analysis of gene regulatory networks, we review a graph decimation 

algorithm, called "leaf-removal", which can be used to evaluate the feedback in a random graph 
ensemble. In doing this, we consider the possibility of analyzing networks where the diagonal of 
£SJ ' the adjacency matrix is structured, that is, has a fixed number of nonzero entries. We test these 

ideas on a network model with fixed degree, using both numerical and analytical calculations. Our 
results are the following. First, the leaf-removal behavior for large system size enables to distinguish 
between different regimes of feedback. We show their relations and the connection with the onset 
of complexity in the graph. Second, the influence of the diagonal structure on this behavior can be 
relevant. 
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I. INTRODUCTION. 

Gene regulatory networks are graphs that represent interactions between genes or proteins. They are the simplest 
way to conceptualize the complex physico-chemical mechanisms that transform genes into proteins and modulate 
1 their activity in space and time. In the network view, all these processes are projected in a static, purely topological 
picture, which is sometimes simple enough to explore quantitatively Thanks to the systematic collection of many 
experimental results in databases, and to new large scale experimental and computational techniques that enable to 
sample these interactions, these graphs are now accessible to a significant extent. Some examples are the undirected 
graphs of protein-protein interactions, and the directed graphs of transcription and metabolic networks 0, 0] • 
■ The availability of such large-scale interaction data is extremely important for post-genomic biology, and has provided 
for the first time a whole-system overview on the global relationships among players in a living system. 
\Q . The hope is to study these graphs together with the available information on the genes and the physics/chemistry 
of their interactions to infer information on the architecture and evolution of living organisms. In this program, the 
simplest possible approach to take is to study the topology of these networks. For instance, order parameters such 
as the connectivity and the clustering coefficient have been considered Other investigators have focused on the 
relations of gene-regulatory graphs with other observables, such as spatial distribution of genes, genome evolution, 



and gene expression 
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F ^ | Typically, in an investigation concerning a topological feature of a biological network, one generates so called 
O" 1 "randomized counterparts" of the original data set as a null model. That is, random networks which conserve some 
topological observables of the original. The main biological question that underlies these studies asks to establish 
when and to what extent the observed biological topology, and thus loosely the living system, deviate from the 
"typical case" statistics. To answer this question, the tools from the statistical mechanics of complex systems are 
, appropriate. For example, a topological feature that has lead to relevant findings is the occurrence of small subgraphs 
— - ■ - or "motifs" 

The study presented here focuses on the topology, and in particular on the problem of evaluating and characterizing 
the feedback present in the network. On a generic biological standpoint, this is an important issue, as it is related to 
the states and the dynamics that a network can exhibit. Roughly speaking, the existence of feedback in the network 
topology is a necessary condition for the dynamics of the network to show multistability and cycles |12| . In presence of 
feedback, the relations between internal variables play an important role, as opposed to situations where the network 
is tree-like, and the external conditions determine completely the configurations and the dynamics. Recently, we came 
to similar conclusions analyzing the structure of the compatible gene expression patterns (fixed points) in a a Boolean 
model of a transcription network . This model exhibits a transition between a regime of simple gene control, and 
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a regime of complex control, where the internal variables become relevant and dynamically non-trivial solutions are 
possible. These regimes correspond to the SAT, and HARD-SAT phases of random-instance satisfiability problems. 
For random Boolean functions, the two regimes can be understood completely in terms of feedback in the network 
topology. A selection of the Boolean functions can change this outcome [l4( . 

Rather than dealing with specific experimental networks, this is meant as a theoretical study on a model graph en- 
semble |23|. Our purpose here is twofold. First, to introduce some "order parameters", i.e. functions that describe the 
relevant feedback properties, connected to algorithms that can be used to evaluate the feedback without enumerating 
the cycles. Second, to study an ensemble of random graphs, or randomization technique, with structured adjacency 
matrices, that conserve the number of entries in their diagonal. This choice, which we will justify, leads to a distinct 
behavior. The two problems are introduced in section [n] We show the connections between different points of view 
on the problem, using simple algebraic, graph theoretical, and statistical mechanical tools. The first approach is an 
application of a decimation algorithm called "leaf-removal" 0, 0] . This algorithm links the feedback to the existence 
of a percolating "core" in the network, containing cycles. The numbers of core variables and edges can then be used 
as order parameters for the feedback. Here, we formulate three variants of the leaf-removal algorithm, and discuss the 
statistical meaning and the relations between them and different levels of feedback. Namely, for an oriented graph, 
one can use these algorithms to define and distinguish "simple" from "complex" feedback. Furthermore, we discuss 
how one can connect feedback to the satisfiability-like optimization problem of counting the solutions of a random 
linear system on the Galois field GF2 17]. This can also be seen as a linear algebra problem concerning the kernel and 
rank of the connectivity matrix. The theoretical motivation for the choice of an ensemble with structured diagonal 
will follow naturally from this discussion. In section ITTT1 we present our main results, as a series of "phase diagrams", 
which describe the typical feedback of random realizations of the graphs. In the unstructured case, the phase diagrams 
obtained by leaf-removal show the existence of five regimes, or "phases" , characterizing the feedback in the limit of 
infinite graph size. Some of these regimes are connected to complexity transitions for the associated random GF2 
optimization problem. Moreover, we show that the choice of a structured diagonal leads to a quantitatively different 
behavior, and thus to a significantly different amount of feedback in the graph. These differences are greatly enhanced 
if the degree distribution is scale- free. 



II. FORMULATION OF THE PROBLEM AND ALGORITHMS 



The problem we want to address consists in evaluating the feedback in a random ensemble of graphs. While the 
range of application is more general, to avoid excess of ambiguity we choose a specific ensemble of graphs that will 
be treated in detail throughout the paper. We consider oriented graphs, where each node has p incoming links. The 
graph ensemble can be specified through a M x 7V Boolean matrix B (having elements or 1). B represents the 
input-output relationships in the network. If Xi are network nodes, Bji = 1 if Xi — * Xj, and zero otherwise. The 
matrix is rectangular because only M < N nodes have an input. We allow for self links, or diagonal elements. For 
a simple directed graph one can say that feedback exists as soon as closed paths of directed edges emerge. Having 
in mind the fact that, while here we consider only topological properties, the incoming links are "inputs", that is, 
they encode for some conditions on the nodes (for example, on gene expression), we can also use a separate graphical 
representation for the nodes, or "variables", and the "functions" regulating these variables. This representation is a 
bipartite graph (Fig. [TJ. Each graph has N variables and M functions, and thus on average 7 = M/N functions per 
variable. 

An important point concerning randomization, is that the choice of what feature to conserve and what not to 
conserve in the randomized counterpart is quite delicate and depends on specific considerations on the system. In 
the words of statistical mechanics, the typical case scenario can vary greatly with the choice of the ensemble. For 
instance, the network motifs shown by randomizing a network with an Erdos-Renyi random graph differ from the 
usual ones, for which the degree sequence is used as a topological invariant 0]. In studies of biological networks, the 
diagonal of B is normally disregarded, or assumed to have the same probability distribution as a row or a column. 
The use of considering it is a matter of the nature of the graph and the property under exam. For the case of 
transcription networks, an ensemble with structured diagonal might have some relevance. For example, for motifs 
discovery, sometimes one puts the diagonal to zero, and considers degree-conserving randomizations that do not 
involve the diagonal l21l. In our earlier work on transcription networks, we have considered the autoregulators as 
a structured diagonal ^j- We will show, for our model graph ensemble, that this leads to considerably different 
results for the feedback. There are other biological examples where a structured adjacency matrix emerges naturally. 
The simplest example are mixed interaction graphs. For instance, one can consider a composition of a transcription 
network with a protein-interaction network (which is a non directed graph) and pose the question of evaluating the 
feedback on a global scale compared to randomized counterparts. 
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FIG. 1: Different representations of interactions in the graph G. Left: oriented graph Right: a bipartite oriented graph. V\ is 
a function and Xi are variables, xo is the output. 

a. Leaf-removal algorithms. A straightforward way to measure the amount of feedback in a graph is to count 
cycles. However, this is in general computationally as costly as enumerating all the paths. For this reason, it is 
desirable to use algorithms and order parameters that allow a quicker evaluation. To this aim, we describe three 
variants of a decimation algorithm, termed "leaf-removal" , that is able to remove the tree-like parts of the graph, 
leaving the components with feedback. We define a leaf as a variable having only incoming links, and a "free" variable, 
or a root, a variable having only outgoing links (FigEJ). 7 is a measure for the fraction of regulated variables, as 
opposed to external variables which only enter functions as inputs. The three variants of the leaf-removal iteratively 
remove links and nodes from the graph, using the following prescriptions (FigEJ. 

1. LRa. Remove leaves and their incoming links. 

2. LRb. As above. Additionally, remove incoming links of nodes whose incoming links are all connected to roots, 
which are also removed. 

3. LRc. As LRa. Additionally, remove all the incoming links (together with their associated nodes) of nodes whose 
incoming links are connected to at least one root. 

This is an iterative nonlinear procedure, where more variables may disappear in a single move. LRc works naturally 
on directed and undirected bipartite graphs. In fact, viewing the system as a bipartite graph, one can verify that LRc 
is equivalent to removing all the functions connected to a single node, ignoring directionality. Instead, LRa and LRb 
are thought for a directed graph, such as the ones we consider here. 

There are two possible outcomes for the leaf-removal. Removing the whole graph, or stopping at a core subgraph 
that contains cycles. The core is composed of Nc genes and Mc functions. We want to use these as order parameters 
for the feedback. Equivalently, we can use Ac = Nc ^. AIc and 7c = Mc /Nc- The difference between LRa and LRb is 
that LRb is able to remove tree-like parts of the graph that arc upstream of a simple cycle. LRc is also able to do this. 
On the other hand, LRc might break some of these cycles because it disregards the orientations of the edges (Fig. El- 
LRc cannot break "complex" cycles, defined as cycles where each node is connected to at least two functions. 

b. Connections with Random Systems in GF2 and Adjacency Matrix Algebra. To investigate the feedback prop- 
erties of the graph, one can also consider the following linear system in the Galois field GF2 (the set {0, 1} with the 
conventional operations of product, and sum modulo 2). 

Ax = v . (1) 

Here, v is a random vector of GF2 M , that represents the functions, and A = B + Imn, where Imn is the truncated 
M x N identity matrix, and the sums are in GF2. In other words, we imagine that each output variable is subject 
to a random XOR constraint, and the idea is to use this as a probe for the feedback. Each XOR constraint, or GF2 
equation corresponds to a function. In the language of statistical mechanics, the random linear system Q maps to 
a p-spin model on the graph 0|. The important point is that feedback translates into algebraic properties of the 
matrix A in GF2, and in solutions of Eq. JJ). A feedback loop, or a cycle, corresponds to the pair A", h°, where A° is 
a, I x I submatrix of A, and h° is a /-component vector such as h°A° = 0. Indeed, the functions and variables selected 
by the nonzero elements of h° are such that each variable appears in an even number of constraints. 
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FIG. 2: Left: example of roots (free variables) and leaves for the leaf-removal algorithm. This graph contains a simple cycle (in 
red), which is not removed by LRa and LRb, but is removed by LRc. Middle: examples of a complex cycle and a hypercycle. 
A complex cycle (top) is not removed by LRc, but does not belong to the kernel of of A*. A hypercycle (bottom) is an element 
of the kernel of A* , because each variable appears in an even number of functions. Right: example of cores for the different 
leaf-removal variants, applied on the same initial graph. The image refers to a random graph with p = 3, 7 = 0. 5, N — 600. 
The cores are represented as a directed graph, and superimposed. The LRa core (whole figure) contains feedback loops and 
tree-like regions (black) upstream of the loops. The LRb core (red) does not contain the treelike parts, but all the feedback is 
preserved. The LRc core is empty, as this algorithm is able to break simple cycles connected to single free variables. The cycle 
of the original graph is indicated by circled nodes and dashed edges (blue) . 



We can also define a "hypercycle" as an M component vector h of GF2, such as the right product hA = 0, because 
the functions and variables selected by the ones in h are such that each variable appears in an even number of 
functions. Graphically, a hypercycle is a connected cluster made of functions that share an even number of nodes 
(Fig.0). From the algebraic point of view, it is an element of the kernel of A 1 , and is then connected to the solvability 
of Eq. . This consideration enables to evaluate the average number Af of solutions of Eq. ^ Perhaps surprisingly, 
one can prove that AT = 2 N ~ M under very general conditions. However, this average ceases to be significant when 
the hypercycles become extensive (i.e., the number of nodes they involve has order N), as the fluctuations become 
dominant. This is discussed in detail in Appendix I A II The exact threshold for 7 where hypercycles become extensive 
is a phase transition in the thermodynamic limit N — > 00, M — ► 00 at constant 7. Precisely, it is called the SAT- 
UNSAT transition for Eq. JJJ ljj. The UNSAT threshold depends on the graph ensemble, and has been determined 
in some cases |2jj. In some instances, there may exist also an intermediate "HARD-SAT" or glassy phase, where 
2 N ~ M solutions exists, but they belong to basins of attractions whose distance from each otherfijj is order N. For 
a p-spin problem on a graph, this glassy phase corresponds to the presence of complex cycles [lrlj . 

c. Structured diagonal. As a hypercycle is a particular realization of a complex cycle, it is easy to understand 
how the core of a leaf-removal algorithm will in general (but not always) contain hypercycles: none of the algorithms 
is able to break these structures. This is shown in Appendix IA 21 which discusses the relation of the leaf-removal 
"moves" with operations on the rows and columns of A, related to the solution of Eq. (JJ. As explained there, for a 
directed graph, the extensive hypercycle, or UNSAT region may exist only at 7 = 1. In the case where the diagonal 
is structured, the situation is quite different, and the hypercycle phase can appear for 7 < 1 [l3l. Il4|. The above 
consideration justifies from an abstract standpoint the intermediate situations, with a fixed fraction of ones on the 
diagonal of A. In considering this ensemble of matrices with structured diagonal, we can introduce an additional 
parameter x, that represents the fraction of ones on the diagonal of A. It is important to note that the introduction of 
a structured diagonal in A changes the adjacency matrix, and thus the graph ensemble. This change can have different 
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interpretations. Rather than focusing on a particular one, the objective here is to show on an abstract standpoint 
how the phase behavior of Eq. Q is perturbed by X- 

III. REGIMES OF FEEDBACK 

In this section, we discuss numerical and analytical results for the leaf-removal algorithms that support the general 
considerations above. We considered mainly the ensemble of graphs with fixed indegree p and Poisson-distributed 
outdegree k, p(k) = e~ pl . The diagonals are thrown with independent probability, to ensure that the average 
fraction of ones is % G [0,1]. The choice of a structured diagonal does not perturb the marginal probability distributions 
of columns or rows. One can connect x to the notion of "orientability" . If M > N, it is impossible to orient a graph 
assigning one single output per function. On the other hand, a graph with a structured diagonal can be seen as 
a partially oriented one, where some directed constraints coexist with some undirected ones. In this interpretation 
X = 1 is the simple directed graph with no self-links. The case x = can be seen as a totally undirected graph, a 
similar ensemble to that used in [16|. 

1400 | 1 , , 1 1 , , , 1 1 1 , , 1 1 , , 1 1 
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FIG. 3: Histogram of the core dimensions Nc and Mc as a function of 7 for LRb. The data refer to 10 4 random networks with 
p — 3 and initial size N = 1000. For low 7, the cores are clustered towards the empty graph. At 7 ~ 0.38 the core distribution 
becomes wide. Successively, the mean values grow and the histogram acquires again a sharp single peak at increasing Ma,No- 
This is reminiscent of a second order phase transition. For LRc, this transition is much sharper (first order), and marks the 
onset of complexity in the core solutions 7^. 

We start with the totally orientable case x = 1- F° r each value of 7, at fixed network size N, one can generate 
randomized graphs and evaluate their cores numerically. This procedure is exemplified in Fig. for the case of LRb. 
The figure shows a transition to a regime where the core is nonempty and all the graphs are sharply distributed around 
the average core size. Equivalently, one can evaluate the core order parameter Ac, which vanishes when the core is 
empty or Mc — Nc- The same order parameter is negative when Mc > Nc- Each LR has two critical values. The 
first, 7I, is associated to the emergence of a nonempty (extensive) core. The second 7^ , to the condition Nc < Mc- 
Based on our results, 7 S is always the same for all three leaf-removals, and corresponds to the onset of the UNSAT 
phase of extensive hypercycles. From simulations and analytical work, 7 S = 1. 7^, instead, depends on the ability to 
remove parts of the graph of the different algorithms. 
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As we have seen, LRa can remove less than LRb, because the latter is able to deal with the tree-like parts of the 
graph that lay upstream of the loops. Also, LRb can remove less than LRc, because LRc can break feedback loops 
if they are connected to a single free variable. Thus, one can expect 7^ < 7^ < 7^. This is indeed our observation 
(Fig.©. 
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FIG. 4: Left: Ac (7) for \ — L P — 3- The solid line corresponds to the analytical calculation ( Appendix IA 3fl . The symbols 
are numerical results for 10 3 realizations of graphs with N = 1000. 7d < 7d < Id mark the transition to an extensive core for 
the three leaf-removal algorithms. 7 S = 1 for all three algorithms to the point where Ac becomes negative. Middle: A scheme 
of the resulting phase diagram. Right: Analytical (N —* 00) values of the order parameter Ac for the LRa algorithm, \ = 1 
and different values of p. The order parameter deviates from zero at the threshold 7JJ = 1/p, and crosses again at 7 C = 1. The 
calculation is described in Appendix I A 31 



Based on these results, we can distinguish the following five regimes of feedback: (1) all cores are empty, (2) only 
the LRa core is nonempty, (3) both the LRa and the LRb core are nonempty, (4) all the cores are nonempty with 
Nc > Mc, (5) all the cores have Nc < Mc- These last two regimes can be seen as thermodynamic phases connected 
with the SAT-UNSAT transition of the associated linear system. 

1. There are no feedback loops in the typical case. 

2. Feedback loops emerge, that form a core having an extensive treelike component upstream. The cycles are 
intensive (i.e. the core contains a number of nodes negligible with respect to N, or o(N)), but the tree upstream 
becomes extensive (O(N)). Analytically, one can compute that 7^ corresponds to the percolation- like threshold 
1/p (see Appendix IA 31 and Fig. 0}. Intuitively, as soon as the graph percolates, even in the presence of a small 
region containing cycles, the tree upstream of the feedback loops can span an extensive part of the graph. 

3. There is an extensive core of simple loops. LRb erases the tree upstream of the feedback loops, thus it can only 
have its threshold when the region of cycles itself becomes extensive. So far, we have not been able to compute 
the threshold 7^ analytically. However, our simulations indicate that it lies higher than 7^ (Fig.0J). 

4. HARD-SAT phase. Intensive hypercycles, and extensive complex cycles form the core, where each variable 
appears in 2 or more functions. This gives a clustered structure to the space of solutions in the corresponding 
random linear system. Ac is proportional to the complexity £ of the space of solutions, defined by the relation 
log A/" ~ N(E + S). Here S, the entropy, measures the width of each cluster, while £ counts the number of 
clusters. 

5. UNSAT phase. The hypercycles become extensive. The threshold 7 S = 1 can be compute analytically (see 
Appendix |A 31 and Fig.0} 

Considering now ensembles with a structured diagonal, one can carry the same analysis at fixed values of 7 and x- 
As we discussed above, LRc is not sensitive to graph orientation, and graphs with a structured diagonal can be seen as 
partially oriented ones. Thus, the simplest choice is to forget the other variants of the algorithms and focus on LRc. At 
fixed X) there are three phases SAT, HARD-SAT, and UNSAT. On the other hand, as we argued above, because of the 
structure of the core matrices, these regions vary with Xi an d a new phase diagram can be generated. The interesting 
result is that this ensemble can show quantitatively different thresholds, while leaving the marginal distributions for 
the row and column connectivity unchanged. We have addressed this question numerically, computing the thresholds 
7d(x) aim 7s (x)- The results for the fixed p ensemble are shown in Fig. [SJ The value for both thresholds increases 
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FIG. 5: Left: Phase diagram for p = 3 and structured diagonals (varying x)- There are quantitative changes with respect to 
X — 1. Middle: Phase diagram for scale-free distribution of the outdegree k. 7^ and j s move with the same trend and undergo 
a notable quantitative drift with increasing \. Right: The exponent for the outdegree distribution is a fit from data on the 
transcription network of E. coli |2l)l . 

with increasing \- In particular, 7s(x) becomes exactly 1 in the directed case. On the other hand, the phenomenology 
of the transition does not vary with x, with a discontinuous jump at the onset of a complex cycles phase, as in a first 
order phase transition. Thus, in the fixed p ensemble, there is a marked quantitative change in the thresholds. One 
may wonder whether the impact is the same for ensembles of graphs where the connectivity distributions are wider. 
Throughout the paper we have considered only the ensemble with fixed p and Poisson distributed k. Notably, the 
effect of a structure diagonal becomes larger for scale-free distributions of k. This is illustrated in Fig. [5] where we 
show the phase diagram for a power-law distribution for k with exponent 1.22 fitted from data from E. coli plj . and 
independently thrown columns for A. In this case, the influence of the diagonal can bring the hypercycle threshold 
7 C down by a factor of three. 

IV. DISCUSSION AND CONCLUSIONS 

We presented a theoretical study focused on the evaluation of feedback and the typical behavior of graphs taken from 
a random ensemble. The study focuses specifically on the ensemble of directed graphs with fixed indegree and Poisson 
outdegree. On the other hand, it is inspired by examples of biological graphs. Detecting feedback in large biological 
graphs and their randomized counterparts is important to understand their functioning. The use of our technique is 
that it allows for a quick evaluation and, more importantly, it provides some quantitative large-scale observables that 
can be used to measure the weight and the complexity of feedback loops. In order to do this, we introduce different 
variants of the leaf-removal algorithm, which naturally carry the definition of simple order parameters, depending on 
the properties of the core. We showed how the three algorithms relate to graph properties, algebraic operations on 
the adjacency matrix, and to solutions of the associated linear systems of equations in GF2. This analysis naturally 
leads to the abstract introduction of structured random graphs that conserve the number of entries in the diagonal 
of the adjacency matrix, which might be relevant in some biological situation. 

Our two main results are the following. First, a phase diagram of different regimes of feedback depending on the 
fraction of free variables for an oriented graph. It shows a quite rich behavior of phase transitions that are interesting 
from the statistical physics viewpoint. These include the thresholds observed in diluted spin systems and XOR-like 
satisfiability problems. As already observed in |16| , the onset of the complex phase is deep in the region where cycles 
exist and they involve a subgraph of the order of the graph size. On the other hand, the less intricate feedback 
regimes of intensive simple cycles connected to extensive trees, and of extensive simple cycles, might be relevant to 
characterize the dynamics in biological instances. The leaf-removal algorithms enable to analyze these different forms 
of feedback, that can be "weaker" than the complex cycles and hypercycles that are relevant for the associated GF2 
problem. The second result is that the introduction of a structured diagonal, which can be interpreted as a partial 
orientation in the graph, has some influence on the thresholds. This is particularly true in presence of scale-free degree 
distribution, where we showed a phase diagram inspired by the connectivity in the E. coli transcription network 21]. 
The algorithms described here can be readily applied to biological data sets and their randomized counterparts. We 
are currently addressing this question in relation with the Darwinian evolution of some transcription and mixed 
transcription- and protein- interaction graphs. Finally, while this analysis is loosely inspired to graphs related to gene 
regulation, the need to evaluate the feedback arises in different contexts, where the tools described here could prove 
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useful. 



APPENDIX A: APPENDIX 



1. Solutions of the Random System in GF2 

Evaluating the average number of solutions of Eq. 2] for large N at constant 7 gives information in the feedback 
of the associated graph. We denote the kernel of a matrix by K, its range by R, and their dimensions by K and p 
respectively. If the probability measure for v is flat, the average number of solutions for fixed A is the probability 
that v G R(A), i.e. 

prob(« e R(A)) = = 2~< A ) , 

times the number of elements in K(A) (i.e. 2 K ^ A \ The average number of solutions is thus 

JJ = ( 2 -«(A*) 2 «(A)^ =2 N-M ) (M) 

where we have used the relations p(A) + k(A) — N, p(A t ) + n(A r ) = M, and p(A) — p{A l ). Moreover, with the same 
reasoning, the fluctuations in the number of solutions are 

W= (F) 2 {2 K( - At i) A , 

meaning that when the average (2^ A ^) A is O(l), an average number of solutions Af = 2 are typically found, 
while this is not the case if {2< A )) A is an extensive quantity. In fact, when this "selfaveraging" property breaks 
down, typically no solutions are found, because Af is supported only by the multiplicity of very rare v € R(A). This 
connects the solvability of the system to the topology of the hypercycles. 

There are phase transitions between the two above regimes, tuned by the order parameter 7. The standard approach 
is to take the thermodynamic limit TV — > 00, M — > 00 at constant 7. These transitions depend on the ensemble of 
graphs considered |20j . 



2. Adjacency Matrix and Leaf-removal. 

Let us try to visualize the leaf-removal procedure, for instance LRc, on a generic adjacency matrix. Consider a 
general Boolean matrix A M x N, and apply LRc. Each time we find a leaf, we assign it and its corresponding 
constraint a progressive number, and we use that number as a label for the rows. With these permutations, we 
construct a hierarchy for the leaves, as the leaves of layer a cannot appear in the clauses of layer b > a. In the tree-like 
case, reordering the lines of A, we obtain 
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where (1) is the set of first layer leaves, (2) the second, etc. The last N — M entries of each row correspond to free 
variables. We have thus obtained a triangulation of A, where the diagonal is made of blocks (the layers) of identity 
matrices. 

In the presence of a core, the triangulation can be carried only until a the core is reached, and the the matrix can 
be rearranged to show the core in the lower right corner. If the core has hypercycles, in the UNSAT phase, the matrix 
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structure is 
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Here, Mc > Nc, so typically it will not possible to find solutions to the core linear system on GF2, or the core does 
not contain sufficient free variables. When the ensemble for A is specified, one has to apply this procedure to all the 
realizations. Naturally, the outcome depends on the matrix ensemble. It also depends in general on the variant of 
leaf-removal that one applies. 

d. Structured diagonal. Focusing on the diagonal of A, we note that in presence of hypercycles, one has necessarily 
to have some zeros in the M x M submatrix of A to realize the condition N c < M c . This can be seen in the sketch 
above, where the diagonal elements are followed by an exclamation mark. In particular, the diagonal contains an 
extensive number of ones. Thus, following the above argument, it is easy to realize that M c < N c , and the hypercycle 
phase may exists only marginally at 7 = 1. For our main choice of ensemble, this is the case, as each variable can 
have only one input, so each constraint can always be labeled by the name of its output variable, which will appear 
as a one in the diagonal of A. In the case where the diagonal contains an extensive number of zeros, the situation is 
quite different, and the hypercycle phase can appear for 7 < 1 00]. 



3. Analytical results for LRa 



We present here the analytical calculation for LRa. If fk is the probability to have k outputs, LRa defines a 
dynamics for it, associated by the cancellations of leaves at each time step. For every time t, one can write 

n = n E/fc(*) ; 

N(t)=N^ k > 1 f k (t) = N(l-f (t)) ; 
M(t)p = N E */*(*) • 

The fraction of nonempty columns is given by the probability 1 — fo(t). Writing the increments as, AN k — N Af k = 
N ^f-At, one can choose At = jj, t 6 [0 : 1], and obtain intensive equations of the kind ^f- — I{t)k,h.f h (t) , where I 
is the matrix that represents the flux generated by a move 22]. 

We now separate A in the blocks S and T of constrained and free variables respectively, writing A — [S\T]. The 
variables that appear in T have an outgoing edge but no incoming ones. S has 7 N columns, while T has (1 — 7) N 
columns. All the rows of A have p ones. The distribution for the ones appearing in the columns, i.e. for the outdegree 
k, is Poisson for both S and T, /fc(0) = ^-e~ A , with A(0) = pj. We impose = 1. The lines of A contain on average 
P7 elements in S and (1 — j)p elements in T, thus after one move there are on average pry + 1 elements in S. Defining 
p' = PI + 1- The flux equations can be written as 



^- = ^m=i[kf^ +1 -(k-l)m; forfc>i 



dff — 1 _j_ p'-l \fS] 

dt - 1 + <fe>s(t)-l W2 J 

dfl _ 

dt ~ ^ ' 



where < k > (t) = E kfk(t). Summing the above equations, one obtains the evolution equation for the normaliza- 
tion factor m := E^fc =< ^ >S 

dm s p' — 1 



k = -pi 



dt m s (t) 

With initial condition m s (0) — A(0) = py, the solution is m s (t) = pj (1 — t). m s {t) can then be identified with \{t) 
appearing in the (Poisson) distribution f k {t). — = ^ ' (t) = pj T^t ' f rom which -^y = [1 — t]. Thus, for k > 1, 

f s _ x(t) A ( t ) fc ' 1 
Jk (fc-1)! ' 
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For k = 1, one can then write gj/i = — 1 — j- (Ae~ A ), so that 

/f(f) = -t + e Mt) = _ t + e P7(t-l) . 

The stop time t* of the algorithm is then a solution of the equation t* = e^** -1 ) . This last equation implies that 
if P7 < 1 the lowest solution for the stop-time is t* = 1, or, in other words, all the graph is removed. On the other 
hand, when pj > 1, there is a finite stop time t* < 1, and thus a core. This determines the critical value 7^ = l/p. 
The size of the portion of the core matrix contained in S is given by Mf top = Ng top — (1 — t*). 

In order to evaluate the full core matrix and the order parameters, the same analysis has to be carried out for the 

,-T 

matrix of the free variables, T. In this case, one has p\ = k ty, ■. , where m T (t) = ^ Again, 



and the flux equations are 



1-7 d_fT _ PU-Tj fT 
7 dtJO — m 'J (t) Jl ' 

1-7 JLf T — P( 1 — l) [O fT __ fT] 
7 aWl ~~ m T (t) ^-'2 ./l J 1 

P7 [ 



The last equation can be rewritten as gj/J = m T( t ) [(fc + l)/fe+i ~ ^/J]- As above, summation yields the evolution 
of the normalization constant ■^m T (t) = —pj. 
Thus, = mT $_ piV which gives 



\(t) T _ m T (0)-pjt 
A T (0) ~ m T (0) 



/o T (A) = e^ . 

In conclusion, the stop time t* is a function of (jry), determined by the relation t* — e P7 ^* The transition 
value to an extensive core is then given by 7^ = l/p. The core dimensions can be written as Mq — N 7(1 — t*), and 
= (1 - 7)(1 - f£) = (1 - 7)(1 - t*). This last quantity gives the core order parameter A c = (1 - 7)(1 - t*). A c 
is zero for 7 < 7^, and becomes nonzero at this critical value, in a continuous, non-differentiable way (with an infinite 
jump). The other threshold is easily calculated, as, for any finite pj, t* > 0, thus j c is given by the prefactor 1 — 7 
in Ac crossing zero and becoming negative: j c = 1. 
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